chore: Make graph visualization more scalable#208
chore: Make graph visualization more scalable#208ericfe-google wants to merge 3 commits intomainfrom
Conversation
Summary of ChangesHello @ericfe-google, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the scalability of the graph visualization feature by introducing intelligent data handling based on query result size. It prevents embedding excessively large datasets directly into the HTML, opting instead for a callback mechanism to fetch data from BigQuery for medium-sized results, and imposing a hard limit for extremely large ones. This change improves performance and resource usage, making the visualization more robust for diverse data volumes. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a more scalable approach for graph visualization by conditionally embedding query results in the HTML based on their size. For larger results, it fetches them on-demand from a temporary BigQuery table. The changes include refactoring client creation logic for better code organization and adding new size estimation functions. My review focuses on a bug in the size estimation logic and a maintainability improvement in the server-side handling of on-demand data fetching. The added tests are comprehensive and cover the new functionality well.
f7d3ba0 to
88ab993
Compare
We now only include query results directly in the html when the query results are less than 100 KB. For larger query results, we store only the reference to the destination table in the HTML, and have the python code re-read the query results from the destination table during the callback. Also, added a hard limit of 5 MB in the query result size, beyond which, graph visualization is not supported altogether.
| client_options=bigquery_client_options, | ||
| location=args.location, | ||
| bq_client = core.create_bq_client( | ||
| args.project, args.bigquery_api_endpoint, args.location |
There was a problem hiding this comment.
Nit: since these are all string arguments (not that we've enabled type checking, anyway), passing by keyword could prevent accidentally passing the wrong one to the wrong value.
| args.project, args.bigquery_api_endpoint, args.location | |
| project=args.project, | |
| bigquery_api_endpoint=args.bigquery_api_endpoint, | |
| location=args.location, |
| return " ".join(identities) | ||
|
|
||
|
|
||
| def create_bq_client(project: str, bigquery_api_endpoint: str, location: str): |
There was a problem hiding this comment.
Even more optional: we could force these to be keyword arguments. That's a practice the Vertex team follows, which is helpful because it means we could theoretically reorder the arguments without breaking changes. I'm of mixed opinion about forcing that on users, but I do think it's a useful practice for internal functions like this.
| def create_bq_client(project: str, bigquery_api_endpoint: str, location: str): | |
| def create_bq_client(*, project: str, bigquery_api_endpoint: str, location: str): |
| import copy | ||
| from google.api_core import client_info | ||
| from google.cloud import bigquery | ||
| import IPython # type: ignore | ||
| from bigquery_magics import environment | ||
| import bigquery_magics.config | ||
| import bigquery_magics.version |
There was a problem hiding this comment.
PEP-8:
Imports should be grouped in the following order:
- Standard library imports.
- Related third party imports.
- Local application/library specific imports.
You should put a blank line between each group of imports.
https://peps.python.org/pep-0008/#imports
In this case:
| import copy | |
| from google.api_core import client_info | |
| from google.cloud import bigquery | |
| import IPython # type: ignore | |
| from bigquery_magics import environment | |
| import bigquery_magics.config | |
| import bigquery_magics.version | |
| import copy | |
| from google.api_core import client_info | |
| from google.cloud import bigquery | |
| import IPython # type: ignore | |
| from bigquery_magics import environment | |
| import bigquery_magics.config | |
| import bigquery_magics.version |
| MAX_GRAPH_VISUALIZATION_SIZE = 5000000 | ||
| MAX_GRAPH_VISUALIZATION_QUERY_RESULT_SIZE = 100000 |
There was a problem hiding this comment.
Nit: I find it helpful to group the 0s in Python to better understand the scale at a glance.
Also, "size" is pretty ambiguous. Please rename to include the units. For example _BYTES.
| MAX_GRAPH_VISUALIZATION_SIZE = 5000000 | |
| MAX_GRAPH_VISUALIZATION_QUERY_RESULT_SIZE = 100000 | |
| MAX_GRAPH_VISUALIZATION_BYTES = 5_000_000 | |
| MAX_GRAPH_VISUALIZATION_QUERY_RESULT_BYTES = 100_000 |
|
|
||
|
|
||
| def _estimate_json_size(df: pandas.DataFrame) -> int: | ||
| """Approximates the length of df.to_json(orient='records') |
There was a problem hiding this comment.
I know it's not a perfect estimate, but pandas provides https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html
Could we use that as the starting point, instead? How accurate do you need to be?
| @@ -0,0 +1,73 @@ | |||
| # Copyright 2024 Google LLC | |||
There was a problem hiding this comment.
| # Copyright 2024 Google LLC | |
| # Copyright 2026 Google LLC |
We now only include query results directly in the html when the query results are less than 100 KB. For larger query results, we store only the reference to the destination table in the HTML, and have the python code re-read the query results from the destination table during the callback.
Also, added a hard limit of 5 MB in the query result size, beyond which, graph visualization is not supported altogether.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕